Homework 2

hw2

The second homework

Author

Lindsay Jones

Published

October 17, 2022

Homework 2

Setup

Code

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Question 1

The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population. Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?

Angiography

Code

a_mean <- 18
a_sd <- 9
a_ss <- 847

a_se <- a_sd/sqrt(a_ss)

a_cl <- 0.90  
a_tail <- (1-a_cl)/2
a_tscore <- qt(p = 1-a_tail, df = a_ss-1)

a_ci <- c(a_mean - a_tscore * a_se,
        a_mean + a_tscore * a_se)
print(a_ci)

[1] 17.49078 18.50922

Bypass

Code

b_mean <- 19
b_sd <- 10
b_ss <- 539

b_se <- b_sd/sqrt(b_ss)

b_cl <- 0.90  
b_tail <- (1-b_cl)/2
b_tscore <- qt(p = 1-b_tail, df = b_ss-1)

b_ci <- c(b_mean - b_tscore * b_se,
        b_mean + b_tscore * b_se)
print(b_ci)

[1] 18.29029 19.70971

Code

#assessing which confidence interval is larger
18.50922 - 17.49078

[1] 1.01844

Code

19.70971 - 18.29029

[1] 1.41942

The confidence interval is more narrow for angiographies.

Question 2

A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.

Code

#n = American adults (population), x = sample (surveyed)
n = 1031
x = 567

#use prop.test to find p (confidence interval is 95% by default)
prop.test(x, n)


    1-sample proportions test with continuity correction

data:  x out of n, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5189682 0.5805580
sample estimates:
        p 
0.5499515

p (the proportion of adult Americans who believe that college education is essential for success) is 0.5499515. The 95% confidence interval is [0.5189682, 0.5805580], meaning we can be 95% certain this interval captured the true proportion.

Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per semester for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range (in other words, you can assume they know the population standard deviation). Assuming the significance level to be 5%, what should be the size of the sample?

Code

#calculating standard deviation using the given
(200-30)/4

[1] 42.5

Since our significance level is 5%, our confidence level is 95%. A 95% confidence level corresponds to a z-score of 1.96. From here we can calculate the ideal sample size.

Code

#solve 5 = 1.96((200-30)*.25)/sqrt(x)

(((170*.25)/5)*1.96)^2

[1] 277.5556

UMass needs a sample of approximately 278 students to estimate the mean cost of textbooks.

Question 4

According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.

A) Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.

Assumptions: normal distribution, significance level .05 H₀: μ=500 H₁: μ≠500 Test statistic: 410

Code

#given
p_mean = 500
s_mean = 410
s_size = 9
s = 90

#find standard error
sem = s/sqrt(s_size)

#find t-score
t_score <- (s_mean - p_mean)/sem
t_score

[1] -3

Code

#find p-value
pvalue = 2 * pt(t_score, df=(s_size - 1))
pvalue

[1] 0.01707168

Because the p-value is less than the significance level (.05), we reject the null hypothesis that μ=500.

B) Report the P-value for H₁ : μ < 500. Interpret.

Code

#calculate the p-value for lower tail only
ltail <- pt(t_score, df=(s_size - 1), lower.tail = TRUE)
ltail

[1] 0.008535841

Because the p-value of the lower tail is less than the significance level (.05), we reject H₀, meaning we have evidence that μ < 50.

C) Report and interpret the P-value for H₁: μ > 500.

Code

#calculate the p-value for upper tail only
utail <- pt(t_score, df=(s_size - 1), lower.tail = FALSE)
utail

[1] 0.9914642

Because the p-value of the lower tail is less than the significance level (.05), we reject H₀, meaning we do not have evidence that μ > 500.

Checking my work:

Code

ltail + utail

[1] 1

Question 5

Jones and Smith separately conduct studies to test H₀: μ = 500 against H₁ : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7, with se = 10.0.

A) Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.

Jones

Code

#t-score
Jt_score = (519.5 - 500)/10
Jt_score

[1] 1.95

Code

#p-value
Jp = 2 * pt(Jt_score, df=(1000 - 1), lower.tail = FALSE)
Jp

[1] 0.05145555

Smith

Code

#t-score
St_score = (519.7 - 500)/10
St_score

[1] 1.97

Code

#p-value
Sp = 2 * pt(St_score, df=(1000 - 1), lower.tail = FALSE)
Sp

[1] 0.04911426

B) Using α = 0.05, for each study indicate whether the result is “statistically significant.”

At α = 0.05, Jones’ result is statistically significant (because the p-value is greater than α) but Smith’s result is not.

C) Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value.

Jones’ p-value is only just barely greater than 0.05, and Smith’s p-value is only just barely less than 0.05. It is important to report the p-value because studies with very similar samples could report that the null should or should not be rejected, leading to very different conclusions based on data that is extremely similar.

Question 6

Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.

Code

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.

Code

t.test(gas_taxes, mu = 45.0, alternative = "less")


    One Sample t-test

data:  gas_taxes
t = -1.8857, df = 17, p-value = 0.03827
alternative hypothesis: true mean is less than 45
95 percent confidence interval:
     -Inf 44.67946
sample estimates:
mean of x 
 40.86278

Since the p-value is 0.03827, we can reject the null hypothesis that the average tax per gallon was greater than or equal to 45 cents. However, we do not know from what year(s) the data was collected. Therefore we can conclude with certainty that the average tax per gallon in 2005 was less than 45 cents.

--- title: "Homework 2" author: "Lindsay Jones" description: The second homework date: "10/17/2022" format: html: toc: true code-fold: true code-copy: true code-tools: true categories: - hw2 --- # Homework 2 ## Setup ```{r} library(readr) library(dplyr) ``` ## Question 1 The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network ("Wait Times Data Guide," Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population. Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery? Angiography ```{r} a_mean <- 18 a_sd <- 9 a_ss <- 847 a_se <- a_sd/sqrt(a_ss) a_cl <- 0.90 a_tail <- (1-a_cl)/2 a_tscore <- qt(p = 1-a_tail, df = a_ss-1) a_ci <- c(a_mean - a_tscore * a_se, a_mean + a_tscore * a_se) print(a_ci) ``` Bypass ```{r} b_mean <- 19 b_sd <- 10 b_ss <- 539 b_se <- b_sd/sqrt(b_ss) b_cl <- 0.90 b_tail <- (1-b_cl)/2 b_tscore <- qt(p = 1-b_tail, df = b_ss-1) b_ci <- c(b_mean - b_tscore * b_se, b_mean + b_tscore * b_se) print(b_ci) ``` ```{r} #assessing which confidence interval is larger 18.50922 - 17.49078 19.70971 - 18.29029 ``` The confidence interval is more narrow for angiographies. ## Question 2 A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p. ```{r} #n = American adults (population), x = sample (surveyed) n = 1031 x = 567 #use prop.test to find p (confidence interval is 95% by default) prop.test(x, n) ``` p (the proportion of adult Americans who believe that college education is essential for success) is 0.5499515. The 95% confidence interval is \[0.5189682, 0.5805580\], meaning we can be 95% certain this interval captured the true proportion. ## Question 3 Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per semester for students. The estimate will be useful if it is within \$5 of the true population mean (i.e. they want the confidence interval to have a length of \$10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between \$30 and \$200. They think that the population standard deviation is about a quarter of this range (in other words, you can assume they know the population standard deviation). Assuming the significance level to be 5%, what should be the size of the sample? ```{r} #calculating standard deviation using the given (200-30)/4 ``` Since our significance level is 5%, our confidence level is 95%. A 95% confidence level corresponds to a z-score of 1.96. From here we can calculate the ideal sample size. ```{r} #solve 5 = 1.96((200-30)*.25)/sqrt(x) (((170*.25)/5)*1.96)^2 ``` UMass needs a sample of approximately 278 students to estimate the mean cost of textbooks. ## Question 4 According to a union agreement, the mean income for all senior-level workers in a large service company equals \$500 per week. A representative of a women's group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = \$410 and s = 90. ### A) Test whether the mean income of female employees differs from \$500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result. Assumptions: normal distribution, significance level .05 H₀: μ=500 H₁: μ≠500 Test statistic: 410 ```{r} #given p_mean = 500 s_mean = 410 s_size = 9 s = 90 #find standard error sem = s/sqrt(s_size) #find t-score t_score <- (s_mean - p_mean)/sem t_score #find p-value pvalue = 2 * pt(t_score, df=(s_size - 1)) pvalue ``` Because the p-value is less than the significance level (.05), we reject the null hypothesis that μ=500. ### B) Report the P-value for H₁ : μ \< 500. Interpret. ```{r} #calculate the p-value for lower tail only ltail <- pt(t_score, df=(s_size - 1), lower.tail = TRUE) ltail ``` Because the p-value of the lower tail is less than the significance level (.05), we reject H₀, meaning we have evidence that μ \< 50. ### C) Report and interpret the P-value for H₁: μ \> 500. ```{r} #calculate the p-value for upper tail only utail <- pt(t_score, df=(s_size - 1), lower.tail = FALSE) utail ``` Because the p-value of the lower tail is less than the significance level (.05), we reject H₀, meaning we do not have evidence that μ \> 500. Checking my work: ```{r} ltail + utail ``` ## Question 5 Jones and Smith separately conduct studies to test H₀: μ = 500 against H₁ : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7, with se = 10.0. ### A) Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith. Jones ```{r} #t-score Jt_score = (519.5 - 500)/10 Jt_score #p-value Jp = 2 * pt(Jt_score, df=(1000 - 1), lower.tail = FALSE) Jp ``` Smith ```{r} #t-score St_score = (519.7 - 500)/10 St_score #p-value Sp = 2 * pt(St_score, df=(1000 - 1), lower.tail = FALSE) Sp ``` ### B) Using α = 0.05, for each study indicate whether the result is “statistically significant.” At α = 0.05, Jones' result is statistically significant (because the p-value is greater than α) but Smith's result is not. ### C) Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value. Jones' p-value is only just barely greater than 0.05, and Smith's p-value is only just barely less than 0.05. It is important to report the p-value because studies with very similar samples could report that the null should or should not be rejected, leading to very different conclusions based on data that is extremely similar. ## Question 6 Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes. ```{r} gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02) ``` Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain. ```{r} t.test(gas_taxes, mu = 45.0, alternative = "less") ``` Since the p-value is 0.03827, we can reject the null hypothesis that the average tax per gallon was greater than or equal to 45 cents. However, we do not know from what year(s) the data was collected. Therefore we can conclude with certainty that the average tax per gallon in 2005 was less than 45 cents.